Tabi: An Efficient Multi-Level Inference System for Large Language Models | Proceedings of the Eighteenth European Conference on Computer Systems

小模型协助推理,已读,未整理


[2311.15566] SpotServe: Serving Generative Large Language Models on Preemptible Instances

抢占式节点上的LLM推理,可能和loongserve有协作的地方


Meta OSDI的两个工作

MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale | USENIX

调度器的工作,听说质量非常好

Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences | USENIX

资源分配的工作


Mooncake: Kimi’s KVCache-centric Architecture for LLM Serving

月之暗面的工作

Mooncake (4): 月饼的皮和馅是怎样制成的,Mooncake 传输引擎开源以及后续的计划 - 知乎

传输部分的优化


Xiaodong Wang

市场理论

高级微观经济学系列:General Equilibrium

贝叶斯优化


DLRover

蚂蚁集团论文解析


[2401.11181] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

对于不同的请求的一个优化


Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

alibaba长文本


[2406.17565] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

华为PD 分离的工作


[2405.07719] USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

方佳瑞关于长文本的工作,里面包含了带宽需求分析


hao-ai-lab/vllm-ltr: [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank

预测什么时候完成LLM


penghuima/awesome-serverless-papers: Collect papers about serverless computing research

serverless 论文集

Pyxis: Scheduling Mixed Tasks in Disaggregated Datacenters | IEEE Journals & Magazine | IEEE Xplore

Jinxin组论文


论文与代码阅读笔记

Jin Xin 组PhD博士,个人主页有很多相关的论文笔记


OSDI 2024 阅读评述连载(一)

OSDI 2024 阅读评述连载(二)

Session 4 DL

Session 6 Cloud Computing

OSDI 2024 阅读评述连载(三)

Session 11 ML Scheduling

[2406.19707] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management


线性代数

MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018 - YouTube


[2412.03213] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

InfiniGen的后续工作


[原创长文]2024.10-开源大模型推理引擎现状及常见推理优化方法 - 知乎


长文本稀疏化工作

DuoAttention

[2410.05076] TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

[2407.02490] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

NIPS 2024 Star Attention Pro Max版本


SOSP 2024 阅读评述连载(一)

SOSP 2024 阅读评述连载(二)

Session 3 Deep Learning and Training

SOSP 2024 阅读评述连载(三)

Session 6 Serverless

SOSP 2024 阅读评述连载(四)

Session 9 ML Serving


results matching ""

    No results matching ""